Real-world effectiveness of a social-psychological intervention translated from controlled trials to classrooms

Social-psychological interventions have raised the learning and performance of students in rigorous efficacy trials. Yet, after they are distributed “in the wild” for students to self-administer, there has been little research following up on their translational effectiveness. We used cutting-edge educational technology to tailor, scale up, and track a previously-validated Strategic Resource Use intervention among 12,065 college students in 14 STEM and Economics classes. Students who self-administered this “Exam Playbook” benefitted by an average of 2.17 percentage points (i.e., a standardized effect size of 0.18), compared to non-users. This effect size was 1.65 percentage points when controlling for college entrance exam scores and 1.75 [−1.88] for adding [dropping] the Exam Playbook in stratified matching analyses. Average benefits differed in magnitude by the conduciveness of the class climate (including peer norms and incentives), gender, first-generation status, as well as how often and how early they used the intervention. These findings on how, when, and who naturally adopts these resources address a need to improve prediction, translation, and scalability of social-psychological intervention benefits.


Supplementary Table 2
Descriptives of the total number and percentages of (i) students who adopted the Exam Playbook, compared to (ii) students who never used the Exam Playbook; and (iii) students who dropped the Exam Playbook, compared to (iv)  Note. The sample size in Intro Economics Winter semester was too small for some analyses, hence also the large error bar in the main text Figure 1 forest plot.

Supplementary Note 1 Additional Details on Definition and Operationalization of Exam Playbook "Use"
As described in the main text, we operationalized a "use" of the Exam Playbook to mean accessing and completing the intervention, including: completing the resource checklist, explaining why each resource would be useful, and planning resource use. Students had to click through to the end of the intervention to be counted as having used it.
In the table below, we detail how many instances there were of students who started using the Exam Playbook, and how many of those students finished it. For some classes, such as both Intro Programming classes, and Intro Statistics, over 83% of students who started the resource finished it. For other classes, the completion rates were lower, ranging from 30%-65%. In this paper, we counted only instances where students completed the Exam Playbook as a "use."

Description of Selected Resources
While the resource checklist were tailored based on the respective classes, we observed that students gravitated to similar types of resources across courses. The table below presents the top 5 selected resources (by Exam Playbook users) within each class, as well as the proportion of instances in which these resources were selected. From these rankings, we observed the following common types of resources that were consistently popular across classes: 1) practice problems (e.g., problem roulette, practice exams; popular in 12 out of 14 classes); 2) lecture content and notes (e.g., lecture notes, lecture slides; popular in 10 classes); and 3) exam review sessions (popular in 8 classes). While noting that some courses have unique resources (e.g., Videos on Flip It Physics, formula card), these trends suggest that there was, in general, consistency in the types of learning resources that Exam Playbook users tended to choose across classes. Note. The percentage refers to the proportion that the resource was selected over the total number of opportunities given (or number of times the Exam Playbook was used).

Difference-in-difference Analysis
An alternative method of assessing the effect of adopting or dropping the Exam Playbook is using a difference-in-differences (DiD) regression model (Angrist & Pischke, 2008). Here, we report our results using this model and show that it replicates our results obtained by stratified matching that were reported in the main text.
Similar to our analysis using stratified matching, we restricted our analyses to only the first two exams of each class. To estimate the effect of adopting the Exam Playbook, we took the subset of students who did not use the Exam Playbook on their first exam. For each class, we ran a separate DiD model, controlling for college entrance scores, gender, race, and first-generation status, and aggregated the regression estimates using a random-effects metaanalysis. After students adopted the Exam Playbook, they performed better on the subsequent exam by an average of 2.04 percentage points ([0.81, 3.26], d = 0.16, p = .001). We repeated this analysis to estimate the effect of dropping the Exam Playbook, by taking the subset of students who used the Exam Playbook on their first exam. Controlling for college entrance scores, gender, race, and first-generation status, we estimated that after dropping the Exam Playbook, students performed worse by 1.80 percentage points ([-3.15, -0.44], d = -0.12, p = .009). These estimates were consistent in terms of general direction and magnitude with the estimates from our analyses using stratified matching (1.75 percentage points, d = 0.12 for adopting and -1.88 percentage points, d = -0.14 for dropping).
If we exclude Introduction to Statistics to test the generalization of the Exam Playbook, we find that the difference-in-difference analysis still yields a significant positive effect of adoption. Controlling for college entrance scores, gender, race, and first-generation status, students who adopted the Exam Playbook performed better on the subsequent exam by an average of 1.75 percentage points ([0.27, 3.22], d = 0.14, p = 0.021). However, excluding Introductory Statistics, we found that the difference-in-difference effect for dropping the playbook is not statistically significant at the 0.05 level (b = 0.38 percentage points ([-1.59, 2.35], d = 0.03, p = 0.705).
Specifically, these were the models we ran on each class. To estimate the effect of adopting the Exam Playbook, we ran: lm(exam_score ~ adopted_playbook*time + college_entrance_score + gender + race + first_g en, data = subset(exam_lvl, did_not_use_playbook_on_exam1)) This analysis only includes the students who did not use the Exam Playbook in the first exam. "adopted_playbook" is a dummy-coded variable that indicates the students who started using the Exam Playbook on their second exam.
To estimate the effect of dropping the Exam Playbook, we ran: lm(exam_score ~ dropped_playbook*time + college_entrance_score + gender + race + first_ gen, data = subset(exam_lvl, used_playbook_on_exam1)) This analysis only includes the students who used the Exam Playbook in the first exam. "dropped_playbook" is a dummy-coded variable that indicates the students who dropped the Exam Playbook on their second exam.

Supplementary Note 4 Additional Information About Administration Timing
Due to logistical errors in communication between the intervention administration team and instructors, 137 (1.1% out of 12,065) students were accidentally given access to the Exam Playbook earlier than 10 days prior to their exams. Because the planned official release date was 10 days prior to the exam, and this was also the earliest timing that the vast majority of students could access the Exam Playbook via ECoach, in the main paper analysis we report analyses using a truncated "time_left" variable that ensured values fell between 0-10 (i.e., any value above 10 was replaced with 10). Nevertheless, we also repeated this analysis without truncation (i.e., using 15 days before the exam that was the maximum time that any student had accessed the Exam Playbook). Consistent with the main findings, students who used the Exam Playbook benefitted more from using it earlier (b = 0.42 percentage points per day [0.29, 0.54], p < .001; compared to b = 0.42 percentage points per day without truncation).

Mixed-effects Hierarchical Linear Modelling
In our analyses in the main text, we used a mixed-effects meta-analysis model to aggregate the effect size estimates across the different classes, treating each class as a separate "experiment". We preferred this approach as we wanted to further examine heterogeneity across classes. An alternative analysis approach is mixed-effects hierarchical linear modelling, where we treat students as nested within course and semester. Here, we report our results using this alternative approach, using the lme4 package (v1.1-26; Bates et al., 2015), of estimation and show that we can draw similar conclusions.

Effect of Using the Exam Playbook.
To estimate the effect of using the Exam Playbook, we used a dummy-coded variable indicating that a student used the Exam Playbook at least once throughout the semester (playbook_user) to predict their average exam score in the class. We added random effects by course and semester (Note: We tried fitting a model with course nested within semester, but the model reported a singular fit, suggesting that the random-effect structure is over-fitted.). Specifically, we ran the following model: lmer(avg_exam_score ~ playbook_user + (1|course) + (1|semester), data= user_lvl) Consistent with the meta-analysis model, we found that students who used the Exam Playbook outperformed students who did not (b = 2.07 percentage points [1.51, 2.64], d = 0.11, p < .001; compared to 2.17 percentage points, d = 0.18, estimated by meta-analysis). We did a further robustness check to repeat this analysis at the exam level: lmer(exam_score ~ used_playbook + (1|exam:course) + (1|student:course) + (1|semester), data= exam_lvl) We found that students who used the Exam Playbook on a given exam performed better than students who did not (b = 2.94 percentage points [2.60, 3.28], d = 0.12, p < .001; compared to 2.91 percentage points, d = 0.22, estimated by meta-analysis).

Dosage and Timing.
To estimate the dosage effect, we considered the subset of Exam Playbook users, and used the number of times they used the Exam Playbook to predict their average exam score in the class. We added random effects by course and semester. lmer(avg_exam_score ~ sum_playbook_usage + (1|course) + (1|semester), data= playbook_users) We found that among students who used the Exam Playbook, using the Exam Playbook on more occasions related to better average exam performance (b = 3.33 percentage points [2.90, 3.76], d = 0.26, p < .001; compared to b = 2.18, d = 0.18, estimated via meta-analyses).
To estimate how timing of usage affects exam performance, we again considered the subset of Exam Playbook users, but now examined performance on each individual exam. We defined a variable, "time_left", which counts the number of days between the Exam Playbook usage and the exam itself. We used this to predict students' exam score. Because this was at the exam level (which is nested within course and semester), we used the following random effect structure: lmer(exam_score ~ time_left + (1|exam:course:semester) + (1|course) + (1|semester), data= p laybook_users_exam_level) We found that students who used the Exam Playbook benefited more from using it earlier (b = 0.53 percentage points per day [95% CI: 0.46, 0.61], d = 0.39, p < .001; compared to b = 0.42, d = 0.03, estimated via meta-analyses).

R Code Corresponding to Main Text Analyses
Treatment effect of Exam Playbook. For each class, we predicted students' average exam performance using a binary predictor that indicated whether the student used the Exam Playbook at least once in the class. We then aggregated the estimates from the 14 individual models, weighting them using their standard errors.
playbook_effect <-lm(avg_exam ~ used_playbook, data=class_data) ... # repeat for all classes, merge into all_playbook_effects meta::metagen(all_playbook_effects$estimates, all_playbook_effects$se, comb.random = TRUE) Class Heterogeneity Analysis. We predicted the Exam Playbook effect size of each class using the proportion of Exam Playbook usage in the class (i.e., proportion of students that used the Exam Playbook at least once, from 0 to 1), and a binary predictor indicating whether extra course credit was offered for using the Exam Playbook. lm(playbook_effect_size ~ proportion_playbook_use + course_credit, data = all_classes) Stratified Matching Analysis. We performed stratified matching using the MatchIt package (v4.2.0; Ho et al., 2011). This analysis first computes a propensity score by using the covariates (previous exam score, college entrance score, gender, race, and first-generation status) to predict the treatment group (e.g., adopted the Exam Playbook versus not) via logistic regression. It then stratifies the propensity scores based on quantile. Based on these strata, the final regression model is weighted to give an estimate of the Average Treatment Effect (ATE). Below, usage_pattern is a categorical variable that could be "adopted" versus "not adopted", or "dropped" versus "not dropped". matched_data = match.data( matchit(usage_pattern ~ E1_exam_score + college_entrance_score + gender + race + first_gen, data=class_data, method = "subclass", subclass = 5, # 5 strata estimand = "ATE")) lm(E2_exam_score ~ usage_pattern + E1_exam_score + college_entrance_score + gender + race + first_gen, weights = weights, data = matched_data) Dosage and Timing. We fit linear models for each class before estimating an aggregate effect using random-effects meta-analysis. To estimate the dosage effect, we considered the subset of Exam Playbook users, and used the number of times they used the Exam Playbook to predict their average exam score in the class.

lm(avg_exam_score ~ sum_playbook_usage, data= playbook_users)
To estimate how timing of usage affects exam performance, we again considered the subset of Exam Playbook users, but now examined performance on each individual exam. We defined a variable, "time_left," which counts the number of days between the Exam Playbook usage and the exam itself.

Moderation of Exam Playbook usage and effects.
To test for self-selection, we predicted whether a student used the Exam Playbook at least once in the course, using as predictors their college entrance scores, gender, race, and first-generation status. Similar to previous analyses, this analysis was performed separately for each class and aggregated using randomeffects meta-analysis.

Supplementary Note 8 Focus on the First Two Exams in Stratified Matching Analysis
As described in the main text, we performed stratified matching analysis on Exam Playbook usage and exam performance in the first two exams, due to the steep drop-off in Exam Playbook usage after the first two exams across the majority of courses (with the exception of the introductory statistics course; see Table 1). This analytic decision mirrored analysis of the original RCT intervention dosage, where students were only given the intervention up to two times (Chen et al., 2017). As background, the original intervention was developed for a introductory college statistics course with only two exams. Offering two doses was considered enough to convey the message and yet not too much repetition to bore students. This translational study differed by providing as many doses as students voluntarily wanted to self-administer (for example, some classes allowed students to take the Exam Playbook before each of their 3 exams, such that students who did not want to stop at two doses but preferred to use the Exam Playbook more were given the opportunity to do so).
As the usage rates in Table 1 show, however, there was generally a steep drop in students' usage of the Exam Playbook after the first two exams across most courses (with the exception of the introductory statistics course which incentivized the use of the Exam Playbook on the third exam). This steep drop-off in use could mean different things: one, that students get tired of repeatedly seeing the same intervention content and no longer use it after two tries; or two, that students have understood and internalized the psychology of the intervention, and hence no longer need it further.
To elaborate on the first possibility, because each dosage of the intervention is the same in content, offering too many doses of the exact same intervention may bore students, resulting in dropping usage rates beyond two doses. Future iterations of the Exam Playbook that plan to offer students more than two possible doses should examine this (such as through student focus groups or user experience interviews), and improve upon the intervention design.
The latter point on internalization is an important open empirical question that we are keen to further investigate in future research. Can students internalize and generalize the psychology of strategic resource use to their other classes and long-term? Longitudinal research with students' course and performance data could follow up on these results to study how much students truly internalize and transfer the psychology of the Exam Playbook.